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TURBO DECODER AND TURBO INTERLEAVE* 



5 TECHNICAL FIELD 

The present invention relates to a data processing system, and more particularly to 
a turbo decoder and a turbo intet leaver 

BACKGROUND OF THE INVENTION 

1 0 Date signals, in particular those transmitted over a typically hostile RF interface 

(communication channel ), are susceptible to error (channel noise) caused by she interface. 
'Various methods of error correction coding have been developed in order to minimize the 
adverse effects that a hostile interface has on the integrity of communicated data. This is 
also referred to as lowering the Bit Error Rate (BER), which is generally defined as the 

15 ratio of incorrectly received information bits to the total number of received information 
bits. Error correction coding generally involves representing digital data in ways 
designed to be robust with respect to bit errors. Error correction coding enables a 
communication system to recover original data from a signal that has been corrupted. 
Fwo *yp « -' 1 c; ; oi \ roi correction codemt^ - <. < v< trv*.wv. 

20 code ; and a-parallel concatenated .convoUmonal, eonvek-iti-on code (so called turbo code). 
A cons< u jo i v <>iu. a i vi curie transforms input sequence of bits into an output 
sequence of bits through the use of finite-state-machine, where additional bits are added 
to the data stream to allowTo rprovide error-correction capability. In order to increase 
error-correction capability, the amount of additional bits added and the amount of 

25 tenuis ^ i \ the tume sun>rmK>u u. is s bt mc tastd which 

increases decoding complexity. 

In the turbo coding system, a block of data may be encoded with a particular 
i n hot .... i sy sterna md t>\< s« on bus 1 

>i iginal block of input data may -heis rearranged 

30 with an interleaver and then encoded with the same method as that applied to the original 
input data u ate a first set of parity bits Encoded date < s\ stematic bus and 

parity bits) are combined in some manner to form a serial bit stream and transmitted 
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through the communication channel to a turbo decoding system. Turbo decoding systems 
©iwtej or>« rate on noisy versions of the systematic bits and the two sets of parity bits in 
two decoding stages to produce an estimate of the original message bits. The turbo 
decoding svstem uses an rtuato e decoding tk-<rti m ■ sdim and ^-nMM ^n^isis < t 
5 interleaver and deinterfeaver stages? individually matched to constituent decoding stages. 
The decoding stages of the turbo decoding system »*~ BCJR algorithm 

which was originally invented by Bahl, Cocke, Jelinek, and Raviv (hence the nam .e) to 
solve ^maximum a posteriori probability (MAP) detection problem. The BCJR 

hmis.iM\Pt< * in tb tt u u ' i ,s the bit euors b\ 

1 0 estimating the a posteriori probabilities of the individual bits in a code word?-. to- To 
reconstruct the original data sequence, the soft outputs of the BCJR algorithm are hard- 
limited. The decoding stages exchange with each other the obtained soft output 
information and iteration of decoding is ceased when a satisfactory estimate of the 
transmitted information sequence has been achieved. 

i 5 As the turbo code has ex4*«tHeMmpressive performance, which is very close to 

Shannon capacity limits, the 3G mobile radio systems such as W-CDMA and cdma20()0 
ha\ v adopted Ounn \rf<- t > L;-; for channel coding. 

3G wireless systems support the-a ..variable bit rate; which may result in foil 
reconstruction of the turbo interleaver at every 10ms or 20ms frame. Accordingly, 

20 generating the whole interleaved address pattern at once consumes much time and 
requires a large-sized RAM to store the pattern. 

Accordingly, a high speed turbo interleaver which can support a. variable bit rate 
and that does not affect the performance of the turbo coder is required. 

As is well-known, W-CDMA and cdma2000 *r*nsir ss o se are different 

25 in coding rate and i^vtei4eave finterieaying . For example, the coding rate of W-CDMA is 
can >eJ/2 I 3, i \ or I 5 but I e coding rate of cdma2000 is I 3 and 1 1 frame size of 
the W-CDMA is one of twelve numbers 378, 370, 762, ....... and 20730, but that of the 

cdma2<>Ch j i nircgei between 40 and ^514 and in tow - .. of the block 

interleaver in W-CDMA is 32 {8-14 of them are unused) but that of the cdma2000 iscao 

30 be 5, JO, or 20. 
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fVccording] le 1 and programmable e uiredfor3G 

coram unica lion because global roaming is recommended between different 3G standards 
and the frame size may change on a frame base. 



5 SIMM IRY O) Hi H\ ENTION 

Embodiments of the present i nvention provide an interleaver. The interleaver 
comprises a preprocessing means for preparing seed variables and an address generation 
means for generating interleaved addfet^ cJ la ssov using the seed variables on the fly. 
The seed variables are in forms of column vectors whose number of elements is equal to 

10 #kh a \ n of he-rows of the iwo dimensional block interleaver Consequently the 
number of seed variables is less than the size of a data block kta H e a generated 
interleaved address is larger than the si/c ai : \ h <.k d ■ the venerated interleaved 
address is discarded. 

In some embodiments, the seed variables include a base column vector, an 

i 5 increment column vector and a cumulative column vector. The number of elements of all 
three column vectors is equal to the number of few-rows of the interleaver block The 
cumulative column vector is updated by adding the increment, vector to an old cumulative 
column vector after interleaved addresses for one column k are generated by adding the 
base column vector and the cumulative column vector. When updating the cumulative 

20 column vector, if-elements of the updated cumulative column vector are-larger than the 
number of ^fema-eg-columns in the data block u e e<> n ^ v.umi4at ! \e 

ool«fflfv^eet-e^' subt r a cte d are reduced by the number oft' <••< <! m <»f I m f * t the 
data block. 

Elements of the base column vector and the increment column \ ectot 
25 are inter-row per«Hitate df)CTmutaiion . 

Embodiments of the present invention provide a turbo decoding system. The 
turbo decoding system comprises an interleaver comprising a preprocessing means for 
preparing seed variables and an address generation means for generating interleaved 
address using the seed variables, an address queue for storing the-a g enerated interleaved 
30 address equal to or smaller than the interleaver size, an SI SO decoder performing 

recursive e t ood t it o and an LI R me io y connected 



to the SISO decoder arid storing the log likelihood ratio, wherei n the SI SO decoder 
accesses the input data and the log likelihood ratio alternately in a sequential order and in 
an interleaved order using the generated interleaved address. 

In some embodiments, the generated interleaved address is reused as a write 
5 address for writing the log likelihood ratio outputted from the SISO decoder into the LLR 
memory. 

Embodiments of the present invention provide a turbo decoding system 
comprising a processor for generating interleaved addresses and controlling hardware 
blocks, an address queue for storing the generated interleaved addresses, a buffer 

1 0 memory block including an LLR memory for storing a Jog likelihood ratio and a plurality 
of memory blocks for storing soft inputs, an SISO decoder connected to the buffer 
memory block, the SISO decoder including an ACSA network for calculating the log 
likelihood ratio recursively from soft inputs and the log likelihood provided by the LLR 
memory and a plurality of memory blocks for storing intermediate results of the ACSA 

15 network. 

in some embodiments, the processes j es s* ed \ ariables when the 

interleave! structure changes due to the a change of the coding standard orbit rate, and 
generates the interleaved addresses column by column using the seed variables by simple 
add and subtract operations when the interleaved addresses are required. 

20 In some embodiments, the SISO decoder supports aViterbi decoding mode. In 

the "Viterbi decoding mode, the ACSA network performs a Viterbi recursion, the LLR 
memory stores traceback information outputted by the ACSA network, the processor 
proees s e s ^ ef forms traceback from the traceback information read from the LLR memory, 
and one of the memory men -ones of the SISO decoder stores a path metric outputted by 

25 the ACSA network. 

In some embodiments, the processor is a single-instruction and multiple-data 
(SIMD) processor. Preferably, the SIMD processor includes five processing elements, 
mA wheami one ot fiw processing > i c ' ! the otnei tl-ui processing elements, 

processes scalar operation, and fetches, decodes, and executes instructions including 

30 control and multi-cycle scalar instructions, and wherein the other four processing 
ek-mentsonh l\Cv m s.S\H> - - - < m N uj>os 
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Embodiments of the present invention provide an interleaving method for 
rearranging. a data block in a data communication system. The interleaving method 
comprises preparing seed variables and generating interleaved addresses on the fly using 
the seed variables. 

5 

BRIEF DESCRIPTION OF THE DRAWINGS 
Od«-i • i a ptes.ent mxention \vd! be moicitadily understood from the 

following detailed description of the invention when read in conjunction with the 
accompanying drawings, in which: 
1 0 Fig. 1 illustrates a basic turbo-encoding sy stem 

Fig. 2 illustrates a typical eight-state RSC encoder of Fig. 1 . 
Fig. 3 shows a block diagram of the turbo decoding system. 
Fig. 4 shows an extrinsic form of the turbo decoding system of Fig. 3. 
Fig. 5 schematically shows a block diagram of a time-multiplex turbo decoding 
15 system according to an embodiment of the present invention. 

Figs. 6A to 6C illustrate a simple example of prunable block interleave!' with 
interleaver size N :::: 1 8 according to conventional interleaving technique. 

Figs. 7A and 7B illustrate a simple example of prunable block interleaver with 
interleaver size N= i 8 according to the present invention. 
20 1 e ^ . i iv. - block dsagi am of f. t turbo 

decoding system of Fig. 5 in turbo decoding mode. 

Fig 9 M t tomit e-illnstK^ i « el v gram of the. turbo 

decoding system of Fig. 5 in Viterbi decoding mode. 

Fig. 10 schematically shows -i¥\ln h rin i ,\t,' U ACS A network and 
25 related memory blocks of Fig. 8. 

Fig. 1 1 shows an ACSA unit contained in the ACSA A section 1022 ofJPjgJLOfor 
calculating a forward metric A^si-nH • > i«< 

Fig. 12 illustrates a , detailed SIMD processor of Fig. 8. 
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m i ui i ddi m rip no oi s ;j twi \no\ 



The present invention now will be described more fully hereinafter with reference 
to the accompanying drawings, in which typical emb< diment •- < t the invt mi on are 
shown. This invention may, however, be embodied in many different forms and should 
5 not be construed as limited to the embodiments set forth herein. Rather, these 

embodiments are provided so that this disclosure will be thorough and complete, and will 
fully convey the scope of the invention to those skilled, in the art. 

Before proceeding to describe the embodiments of the present invention, typical 
turbo coding system will be described with reference to Figs. 1 to 4 for better 

10 i es ivention 

Fig. 1 illustrates a basic turbo-encoding system 1 00 and Fig. 2 illustrates a typical 
eight-state RSC encoder 102, 106 of Fig. I. 

The encoder of the turbo coding system consists of two constituent systematic 
encoders 102, 106 joined together by means of an interleaver 104. Input data stream u is 

15 applied di rectly to fi rst encoder 102, and the interleaved versi on of the input data stream 
u is applied to second encoder 106. The systematic bits {i.e., the original message bits) 
and the two sets of parity bits x pi and x p2 generated by the two encoders 102, 106 
constitute the output of the turbo encoding system 1 00 and are combined by a 
multiplexing means 1 08 to form a serial bit stream aaekhat is transmitted over the 

20 communication channel. Before transmission, puncturing may be performed if necessary. 
The constituent encoder 102, 106 of the turbo encoding system is recursive 
systematic convolution (RSC) codejencoder, where one or more of the tap outputs in the 
sifWe-risu-i r\jn s 1)1 IX' back to the input for obtaining better performance of the 
overall turbo coding strategy. 

25 Fig. 3 shows j : block diagram of the turbo decoding system 300. The turbo 

decoder 300 operates on a noisy version of the systematic bits y y and a^sy-version-of-the 
two set of parity- bits y*" 1 and -f 2 . The turbo decoding system 300 uses an iterative 
decoding , " - i < i ea\ci 504 and deinterleaver 

308, 310 stages, individually matched to constituent decoding stages 302, 306. The 

30 systematic bits y v and first set of parity bits y" ! of turbo encoded data are applied to the 
first SI SO (Soft-Input-Soft-Output) decoder 302. Additionally, the deinterleaved version 
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oi ti e metrics output \ hi second SlbO decoder n>f s k to. bad to the fust 
SISO decoder 302. The metrics output e &fromt he first SISO decode? '»>•: -s ire applied 
to the second SISO decoder 306 via interieaver 304. The second set of parity bits y p2 is 
applied to the second SISO decoder 306. The output of the deinterleaver 310 is applied to 
5 hard I i miter 3 12 which outputs a bit stream of decoded data « corresponding to the 
original raw data «. 

\s staled L.irhui t ! iL i <!!>'[ ^ ! M i v .vhu s i led b.iJx 

to the metrics-input of the fust SISO decoder 302. Thus,, the turbo decoding system 300 
performs the »th decoding iteration with art input metrics resulting from (w-l)th decoding 
10 iteration. The total number of 'te^ri u predetermined, or the ?wmk>b 

M"-<~.h> ! -at ions jaop if a ccaam -topping entenon me*^- i *e -^u- ih^H^ 'i ^ : 

Fig. 4 shows an extrinsic form of the turbo decoding system of Fig. 3, where I 
stands for interieaver, D for deinterleaver, and SISO for soft-input soft-output decoder, 
which may use a Log- MAP decoding algorithm, a Max-Log-MAP, etc. 
1 5 The first decoding stage produces a soft estimate Ai(«*) of a systematic bit m 

expressed as a log-likelihood ratio 

A, (w t ) - log {P(u k - 1 ! y\ y p] , A 2s (U))/P(u k - 0 1 y\ y A 2e (u))} , * - 1 , 2, , 

AT — <1) 

where jr* is the set of noisy systematic bits, y jA is the set of noisy parity bits 
20 generated by the first encoder 302, and A, o .(u) is the extrinsic information about the set 
of message bits « derived from the second decoding stage and fed back to the first stage. 

Hence, the extrinsic information about the message bits derived from the first 
decoding stage is 

MiO-A^-A^Cu) (2) 

25 where A^(u) is to be defined. 

Before application to the second decoding stage, the extrinsic information A u ,(u) 
is reordered to compensate for the interleaving introduced in the turbo encoding system 
1.00. In addition, the-noi-s-y parity hits y" 2 generated by the second encoder 1.06 are used as 



7 



another input. ¥k»Thenby using BCJR algorithm, the second decoding produces a 
more refined soft estimate of the message bits u. 

This estimate is de-interleaved to produce the total log-likelihood ratio A.,(u) , 
The extrinsic information A ,,(u ) fed back to the first decoding stage is therefore 
5 A^iHA.iuV-AJu) (3) 

uheiu A tu)isuseUi -> ei i - ) < A md \ (h) is the log- 

likelihood ratio computed by the second decoding stage. Specifically, for the kth element 
of the vector u, where we ha ve 

A ? .0/,)-Iog{/>(«, =l\y\y>\A lt {u))/p(u k = Q\y\y*\A lt {u))} t k= 1,2,..., 
10 A' — (4) 

Through the application of A 2{ ,(u) to the first decoding stage, the feedback loop 
around the pair of decoding stages is thereby closed. Note that although in actual fact the 
set of noisy systematic bits y* is only applied to the first decoder 302 i n Fig. 3, by 
formulating the information flow in the symmetric extrinsic manner depicted in Fig. 4 we 
i 5 find that y' is, in fact, also applied to the second decoding stage. 

An estimate of message bits u is computed by hard-limiting the log-likelihood 
ratio A, (u) at the output of the second decoding stage, as shown by 

u ~ sgn(A,(u)) , where the signuro function operates on each element of A 3 (u) 
individually. To initiate the. turbo decoding algorithm, we simply set A . ; (u) : -0 on the 
20 first iteration of the algorithm. 

Now a turbo decoding system of the present invention will be described. 
Embodiments of the present invention provide a multi-standard turbo decoding system 
with a processor on which a software interleave!" is run. Particularly, the present 
invention provides a turbo decoding system with a configurable hardware S1SO decoder 
25 and, i - i igle-instruction and multiple-data (SJMD) processor performing 

flexible tasks such as interlea* ing v < v> e- The.so hvarg turbo interleaver is run on the 
SIMD processor. The decoding system of the present invention can also support the 
Viterbi decoding algorithm as well asji turbo decoding algorithm such as the Log-MAP 
algorithm andthe Max -Log-MAP algorithm. 



The processor generates interleaved addresses for turbo decode?; supporting 
multiple 3G w ireless st a ndard s tandards a t the speed of the hardware SISO decoder and 
changing the. interleave!- structure (i.e., frame size and bit rate) at a very short time with a 
small memory. To hide the timing o%'erhead of interleaving changing, the interleaved 
5 addresses generation Is split into tow-two parts, pre-processing and incremental on -the- 
fly generation P« t p t prepares a small number of seed 

variables and the incremental on-the-fly generation part generates interleaved addms 
*<S based on the seed variables on the fly. When bit rate changes, the processor 
carries out only- the pre-processing part to prepare a small number of seed variables hence 

10 requiring ashort time and a small memory. Whenever the interleaved address sequence is 
required, the processor generates the interleaved address 3, j . es twn» the seed 
variables. This splitting method reduces the timing overhead of the. interleave!' and 
requires only a small memory to save the seeding seed v ariables. 

Fig. 5 schematically shows block j. diagram of a time-multiplex turbo decoding 

1 5 system according to an embodiment of the present invention. 

Turbo decoding system 500 of the present invention comprises an SISO decoder 
502, processor 504 on which soft ware interleave.!' is run and a A« memory 506 for storing 
an extrinsic log-likelihood ratio (LLR) as illustrated in Fig. 5. Data are sequentially 
stored in A e memory 506 as they are always read and written in-place. For each iteration, 

20 data are accessed in a sequential order for the cxkhh SISO decoding and in an interleaved 
order for the everfth SISO decoding Namely, in the odmh (first) SISO decoding, SISO 
decoder 502 receives data in a sequential order from the \ memory 506 and calculate 
calculates a log-likelihood ratio, and the log-likelihood ratio is written into the A« 
memory 506 in the sequential order. In the evettlh (second) SISO decoding, SISO 

25 decoder 502 receives data in an interleaved order from the A e memory 506 and calculate 
c 'o a ies a new log like mood ratio, and the new log-hkehhood ratio ts demterleaved 
with the help of the address queue 508 and written into the A 6 memory 506 in the 
sequential order. As shown with the dotted lines, the processor 504 provides interleaved 
-ddivv, vuVws to read data in an interleaved order. The address queue 508 saves the 

30 addresses in an interleaved order so that the addresses can be used as the write addresses 
for A* memory 506 when the SISO decoder 502 produces results after its latency. 
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Accordingly, date in an interleaved order can be de-interleaved into the sequential order, 
and saved in the \ memory 506 in a -the sequential order. 

In addition to interleaving, the processor 504 can control the hardware blocks or 
interface with an external host, and processes the trellis termination and a stopping 
5 criterion during the first S3SO decoding that does not need interleaved addresses, in a 
Viterbi decoding mode, SI SO decoder 502 is repeatedly used for the Viterbi recursion. 
The A B memory 506 plays the roles of the traceback memory The processor 504 
■process es p erforms the traceback from the traceback information read from, the A* 
memory 506. 

1.0 Flexible software turbo inter! eaver run on. the processor for turbo decoder will be 

described. To hide the timing overhead of interleaver change, the interleaved addresses 
; J-Jk >s generation is split into tow two parts, pre-processing and incremental on-the-fly 
generation. PfB-pfeee^t ^Th.e..pre-proce^.$iflg part prepares a small number of seed 
variables and the incremental on-the-fly generation part generates interleaved a44ress 

1 5 addresses based on the seed variables on the fly. When the interleaver size changes due to 
the- a change of bit rate or the communication standard itself only the pre-processing part, 
prepares a small number of seed variables, not all the interleaved address sequence. 
Through parallel processing using the seed variables, the processor generates interleaved 
addresses as fast as the hardware Si SO decoding rate whenever the interleaved address 

20 sequence is required. The unit of on-the-fly address generation is a column of a block 
interleaver. Interleaved addresses are generated column by column, used for read 
addresses and stored in address queue, and then reused for write addresses for 
deinterleaving 

Before proceeding to describe interleaving k\ u«q k- ^> uh;k^ accoidm^ to the 
25 embodiment of the present invention, conventional interleaving techniques will be 
described for better understanding of the present invention. 

The turbo decoders for wireless system are based on block turbo interleavers. 
Although the operations and parameters vary depending on the standards, W-CDMA and 
cdrna2"0 ( ) k ^ the pnmable block interleaver structure where the iMeriever 
30 mk-rlea\ e; is implemented by building a mother interleaver of a predefined size and then 
pruning unnecessary addresses. The mother interleavers can be viewed as tvvo- 
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Jimu-iMoiu! matrices where the enttie- <i u' u* >\ * w lw iou and 

read out column by column. Before reading out the entries, intra- and inter-row 
permutations are performed. 

Figs, 6A to 6C illustrate a simple example of a typical prunable- block interleaver 
5 with interleaver size N - 18, where the data indexes are written in a matrix form. 
Incoming data ate w tit;e i > ir a two-dimensional mat \ 

memos \ HHv4w--f«w-as shown in. Fig. 6A. Fig. 6B shows intra-row permutation data 
indexes from Fig. 6A. The imra-row permutation rule applied to this example is 
yiJ b,+ [(/t 1 ) **Jmod5 (5), 

1 0 where }> u - is the permuted index of the nh row and Jih column, / and j are row and 

column indexes, respectively, b ::: {bo, bju bibs) ::: (0, 5, 1 0, 15), and q « (qo, q Js qa f qa) :::: 
(1, 2, 3, 7). Fig. 6C shows inter-row permutation data result from Fig. 6B, which will be 
read out front the memory in-column by column as a sequence of 17, 1, 13, 7, 2, 11,9, 
0, 10, 5. The indexes 19, 18 exceeding the range of interest are pruned. 

1 5 Now an interleaving technique according to an embodiment of the present 

invention, will, be described with the example of Figs. 6 A. to 6C and with reference to Figs. 
7A and 7B. 

The present embodiment uses an increment vector w of w s ~ q< mod 5 instead of 
q, and a cumulative vector xj of 
20 x,j= [(/'+]) *#]mod5 —(6). 

Equation (6) can be rewritten as 
yv^bi + xtj —(7) 
and xj can be obtained recursively as 
xy** [(/'+]) *( t / ! mod5)]mod5 
25 - * \r t ] mod 5 

= [{fWj mod 5) + vt : | mod 5 

- (x,r A/ + 1*7) mod 5, (8) 

where./ 1, 2, 3, 4, 5 and %o w. 

As 0£ x i ; ,;<5 and 0:5 w f <5,0^ x t h > + w ; <10 and thus 

30 % - X^,; 4- W/ -5 if %„j -5~ M'i ^ 5, 
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™ Xij.-, + w. otherwise (9) 

According to the embodiments of the present invention, multiplication and 
modulo operations of equation (6) are replaced by cheaper opefaHoa unu< ;..->r^ 
multiplication by an addition and the modulo by a comparison and subtract operation. 
5 As shown in Fig. 7 A, b, w, and xq for the first column of the block interleaver are 

calculated and stored in vector resisted register of the processor in the preprocessing stage. 
The number of elements of each vector b, w, and \,. co re^x u - < , n to the 
number of elements of a column. Right side of the Fig. 7 A shows that b, w, and x (J are 
stored in the order of inter-row permutation such as b 3 , b,, 5 h 2 , h\ ::: (15, 0, 10, 5), and w «* 
10 (W3, w 0 , w 2) wj) ::: (2, 1, 3, 2) :::: x» in advance so as to free the on-the-fly generation from 
the inter-row permutation. 

In the column by column on-the-fly address generation stage shown in Fig. 7B, 
the processor updates x, according to equation (9) and calculates the addresses based on 
equation (7). Calculated addresses are sent to the address queue, if they are smaller than 
1 5 interleaver size N. 

Referring to Fig. 7B, interleaved addresses for first column (ya) is calculated by 
adding bt + x<>. Since xo is w, y-. t is calculated by adding bo = (15, 0, 1 0, 5) + (2, 1, 3, 2) - 
(17, 1, 13, 7). After interleaved addresses for first column isaxe calculated, x, is updated 
by adding x 0 = (2, 1 , 3, 2) and w - (2, 1 , 3, 2). Thereby xj is set to (4, 2, 1 , 4), where the 
20 third element of Xt is i because (3+3) is larger than 5 and thus 5 is subtracted-b-y -S. 
Accordingly, interleaved addresses for the second column (y.e) ts-are calculated by 
adding b.-f x 5 = (15, 0, 10 5) + (4, 2, 1, 4) = (19, 2, 13, 7), where first element 19 is 
discarded since it is larger than or equal to the size #(-18). 

As described above, the present invention requires only small memory for storing 
25 the seed variables and performs add and subtract operations instead of modulo operation 
and multiplication <*pemtkMK>pt rauons 

The above example is a very simple one and the turbo interleavers in the real 
world such as W-CDMA and cdma2000 standard turbo interleavers are much more 
complex. However, the fundamental structure and the basic operations used in 
30 permutation rules are the same. The pseudocodes of the on-the-fly address generation for 
W-CDMA and cdma2000 are shown in Table I and Table 2, respectively. 
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In Table 1, C is the number of columns, R is the number of rows,/> is a prime 
number, and s(x) is a permutation sequence, which are determined from the inter leaver 
size N according to the specification of the W-CDMA standard. Likewise C, K and the 
binary power 2" in Table 2 are determined from N according to the cd.ma2000 
5 specification. The present invention handles those values also as seed variables and 
calculates them in advance at the preprocessing stage. 

The on-tbe-fly generation flows of these real-world turbo interleavers are similar 
to the example. They also have base column vector b, increment column vector w, 
cumulative column vector x, and a certain modulo base, x is updated by adding w to the 
10 old x value. If elements of the updated x are larger than the modulo base, the-those 
elements of the updated cumulative column vector ^ s-ubts a< - < b> the 

modulo base. This operation substitutes a eempt$u« * > - mpnLn oe.d \ expensive 
modulo operation. Then the interleaved addresses for one column are generated by 
adding fo and a vector that is calculated from x. 
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Table 1 



0 column ^counter = C- 1 



loop: 



20 



x-x 4 w 



25 



6 



2 



3 



4 



5 



for each (/--0, I,..., R-l) i% > p ~l) x ; = x, -</>-!) 
load s(x) from the data memory 
y = b + s(x) 

for each (/ -0, 1 A'-l) if (y< <N) send y, to the address queue 

ifucolumn ^counter-)^ 0) goto loop 



Table 2 



30 
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0 column counter : == C-l 
hop: 

1 x=x 4- w 

2 for each (/ -0, 1, . . . , R-l) if(x;- ^ 2 n ) x, = x, - 2* 
5 3 y«b + x 

4 for each (/ -0, 1,..., if(y, <A f ) send y> to the address queue 

5 if((column counter-)* 0) goto loop 



i 0 A SIMD (single-instruction and multiple-data) processor is suitable for this 

operation because the column is a vector and all the column entries get --go through 
exactly the same operations. However, in order to generate one address at every one or 
two cycles, some special instructions r U! , v ised to make the long program short~ea»-&e 

1 5 To speed up address generation, «eme-customiz«d instructions can be used to 

reduce the length of the loop in the on-the-fly generation pan. The present invention 
introduces three processor instructions: STOLT (store to output port if less than), 
SUBGE (subtract if greater or equal), and LOOP Each* - e t hese instructions 
substitutes a sequence of three ordinary instructions but takes only one clock, cycle to 

20 execute. For example, instruction STOLT corresponds to typveaRhree RISC instructions* 
namely SUB x, y, z; BRANCH if z >- 0; STO x. Likewise, SIMD instruction SUBGE 
corresponds to.-- , is RISC instruction s, namely SUB x, y, z; BRANCH 

if z<0; MOVE z, x . 

Pruning can be mapped to STOLT. The function of STOLT is to send the 

25 calculated interleaved address to the address queue only if the. calculated interleaved 

address is smaller than N, which is needed for the pruning as in line 5 of the pseudocode 
of Table 1 and line 4 of Table 2. 

Another conditional instruction SUBGE, is qui-te-usefuS for the block interleaves 
that v r x use modulo operations. Instruction SUBGE substitutes a 
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modulo or remainder operation a mod h if the condition Q<a<2b is satisfied, which 
corresponds to (9) and line 2 of Table 1 and Table 2. 

Adopted in several DSP processors to reduce the loop overhead of the address 
generation, LOOP instruction is also helpful in our application Th-i-s-The ;.LOOP 
5 i nstruction conforms to a sequence of CMP, BNE (branch if not equal ), and SUB 
instructions, which at once decrements the loop count and branches 

Using these special instructions, the present invention can reduce the lengths of 
the on-the-fiy generation program loop of W-CDMA, cdma2000, and CCSDS to six, five, 
and four instructions, respectively Using ajoop-unrolling technique, the present 

10 invention can further shorten &¥&eHhe loop length of the on-the-fiy address generation 
parts by almost one instruction. 

In the turbo interleaver pseudocodes of Table \ and Table 2, each line 
corresponds to an i nstruction of the SIMD processor code. In 1 able 1, the line 2 
corresponds to SUBOE, the line 5 to STOLT, and the line 6 to LOOP. The SUBGE 

I 5 safely substitutes x, - x, mod (p-1 ) because the condition 0 <Xi<2(p~ 1 > is satisfied (0 <.*■; 
<p-1 and 0< Wj <p~i before they are added). IfR ~ 10 or 20 and the processor can 
process five data at once, lines 1-5 are repeated frwee- twoo r four times to produce an 
entire column of the interleaver matrix. Similarly, in Table 2 the line 2 corresponds to 
SUBGE, the line 4 to STOLT, and the line 5 to LOOP. 

20 Fig. 8 illustrates a block diagram of a decoding system^ which can support turbo 

decoding and Viterbi decoding. Fig 8 schematically shows data flow and address flow in 
.a turbo decoding mode. In Fig. 8, a solid line indicates data flows and a dotted line 
indicates address flow. Fig. 9 schematically shows data flow and address flow in a 
Viterbi decoding mode. 

25 Referring to Fig. 8, ajurbo decoding system of the present invention comprises a 

SI SO decoder 810, a processor block 830, abuffer memory block 850 and an address 
queue 870 The processor block 830 includes aSlMD (single-instruction and multiple- 
data) processor 832, an instruction memory 834, and a data memory 836. The S1SO 
decoder 810 implements a typical sliding-window decoder. It includes an ACS A (add- 

30 compare-selector-add) network 812. and a plurality of memory blocks, Fj memory 814, 
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F 2 memory 816, l\ memory 8 IS, A memory 820 and a., hard decision memory 822 The 
buffer memory block 850 includes a A e memory 852 for storing an extrinsic log- 
likelihood ratio (LLR), a.pluraiity of memories for storing soft input data, y* memory 854 
for storing a nosy systematic bits multiplied by the channel reliability 862, y pl - y p3 
5 memories S5o. 858 s • ;'■ v^t • -<-\ mi tv bits multiplied by the channel reliability 
862. 

The ACSA network 812 calculates forward metrics A f: , backward metrics B ;t , and 
an extrinsic log-likelihood ratio (LLR). Memory blocks, J\ memory 814, f\ memory 816, 
and F\ memor> 8 1 s stores- store i nput data and the memory block A 820 temporarily 

10 stores the calculated forward metrics. A Ikrd hard decision output ef -trom the ACSA 
network 812 is stored in the hard decision memory 822. The SIMD processor 832 also 
calculates a stopping criterion during SISO decoding from information stored in the hard 
decision memory 822. Input data are read into one of the memory blocks, F.i. memory 814, 
F 2 memory 816, and F ; * memory 8 1 8 4 and are used three times for calculating ihejbrward 

1 5 metric A*, the backward metric B* s and the LLR A(«*)« 

The Sotl wafe-s« 5 1 ruerleaver is run on the S MO processor 832. As 
described earlier, The SIMD processor 832 generates interleaved addresses column, by 
column and the address queue 870 saves the interleaved addresses. When the SIMD 
processor 832 calculates interleaved read addresses, the address queue 870 whose length 

20 is the SISO latency saves the interleaved addresses in order to use them again as the write 
addresses into the A* memory 852. Namely, when the ACSA network 812 produces 
results using the data read from the interleaved addresses, the results are stored into the 
corresponding place of the A., memory 852 with the write address stored in the address 
queue 870. In addition to the interleaving, the SIMD processor 832 controls the hardware 

25 blocks, interfaces with an external host; and processes the trellis termination and a 
stopping criterion during SISO decoding that does not need an Inter! eaver. 

Since the Viterbi algorithm does not calculate backward metrics in the. Viterbi 
decoding mode, some components of Fig. 8 are unused as shown in Fig. 9. In Fig. 9, 
components illustrated by a dotted line such as a address queue 970, a channel reliability 

30 multiplication 962, and three f memories blocks 914, 916, 918 are not used in the 
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Vsterbi decoding mode. The A s memory 852 of Fig. 8 serves as a traeeback memory 952 
and the A memory 820 of Pig. S serves as a path metric memory. The SIMD processor 
932 processes the traeeback from the traeeback information read from traeeback memory 
952. SISO network 912 is used for the Viterbi forward trellis recursion. 
5 Fig. 10 schematically shows < detailed vjew o* At: ACSA network and related 

memory blocks of Fig. 8. The SISO decoder 1000 in Fig 10 implements a sliding 
window turbo decoding technique in hardware. The ACSA network 1010 includes 
multiplexers 1012, 1014, 1016, 1018, 1020 and ACSA A 1022, two ACSA B 1024, 1026, 
and ACSA A 1 028 . ACSA A 1 022 and ACSA B 1 024 compri se eight ACSA units and 
10 ACSA A 1028 comprises fourteen ACSA units. An em»pvi\ eve up , \ \CSA unit oi 
ACSA A 1002 is illustrated in Fig. 1 1 . 

After input data sequences are stored in the Fj memory 1050 and T 2 memory 
1070, SISO decoding starts while new interleaved data sequences are stored in the F3 
memory 1090. 

15 To implement the sliding window, first a window sizes of I input data are .stored 

in T memories 1050, 1 070, 1090. According to the MAP algorithm, this block of written 
values is read three times. These three operations are performed in parallel in each ACSA 
sections in Fig. 10. The ACSA B section 1026 implements a sliding window algorithm, 
which fd^«4*e require* a dummy backward recursion of depth I. fa order to avoid the use 

20 of a multiple-port memory, three separated F memories of depth /,■ are used so that the 
defined operations operate on them in a cyclic way. A memory 1.030 temporarily stores 
the calculated Ai(s)'s. 

The . SISO outputs are obtained in the reversed order, but the correct order can be 
restored b\ pi oped y changing the interim uiy h;^ u ■ . ol the decoder 

25 To support multiple standards, configurable ACSA units can be employed in the 

SISO decoder in Fig. 10. The ACSA unit shown in Fig. 1 1 can be adapted to various 
RSC codes of different constraint length K, different coding gate-rates of 1/2 to 1/5, 

The Log-MA P algorithm outperforms the Max-Fog-MAP algorithm if the 
30 channel noise is properly estimated. However, it is reported that the Max~Log-MAP is 
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more tolerant to the channel estimation error than the Log-MAP algorithm. Thus present 
invention provides the ACS A unit, that can be selected to use the Log-MAP or the Max- 
Log-MAP algorithm . 

Fig. 1 1 shows an ACS A unit contained in the ACS A A section 1022 for 
calculating forward metric A k (s) of Fig. 10. 

Forward metric A*.(s) is calculated by equation (1.0) 

Ki \ ; (io) 

* max(A,. .,;(*') + V k 

where A. k (V) is the forward metric of state s' at the previous time stamp k ~1 arid 
r*(*\*) is the logarithmic, branch probata ht> s \> ih n the trellis state changes from s' to s 
at time stamp k. The ACS A unit 1 100 includes input multiplexers 1101, 1 103, 1 105, 
J 107, 1 109, 11 11, two adder blocks (CSA) 1 1 13, 1 1 15, two input adders 1117,1119, a 
comparator 1 12, a lookup table 1 1:23, an output multiplexer 1 127, and an output adder 
1127. 

Two adder blocks (CSA) .1 .1 13, 11 .15 calculate branch probabilities P k (Y, s) 
gi ven by equation (1 1) 




where a,(«, ) is the extrinsic LLR information from the previous SISO decoding 
and L 0 is channel reliability. Input data A (in^+Uy-, and Uyf are selected by 

multiplexers 1 10.1 , 1 1 03, 1 105, 1 107, 1 109, 1 1 .1 1 which can change the coding rate and 
the transfer function. The input adder 1 117 adds output branch probabilities V tfo, s) of 
the adder block 1 1 13 and incoming data A .. ■(% ) The input adder 1 119 adds output- 
branch probabilities P jfsi , s) of the adder block 1115 and incoming data A )- The 
comparator 1121 receives two »npm Horn the input adders 1 1 1 7, 11 19 and outputs 

amax value of the two i nputs and ^differential value between two inputs. The 
differential value is used to look up the table 1 123 that stores approximation 

ffs< ts of 1 10), and a max value is transferred in the output adder 1 127. The output 
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multiplexer 1 125 selects decoding algorithm, Log-MAP or Max-Log MAP. If 0 is 
selected at the multiplexer 1 125, output A*(s) of the adder 1 1 27 is given as A k (s) ■■■■■■ 
max(A k . 1 (s') V k . i(s , s) , which corresponds to Max-Log MAP algorithm. Otherwise if 
the lookup table value is selected, output A<.(s) of the adder 1 .1 27 is given as A<-(s) ::: 
5 ln(exp [A k . 1 (s > ) + V *./(s , si)]), which is compensated by the offset read from the lookup 
table and corresponds to Log-MAP algorithm. 

In conventional hardware SI SO decoders, the calculation off £$\ s) is fixed, as 
all the Xk s are fixed for the target turbo code. However, the present invention can change 
the x-; values iB--eqyatkw--by configuring the input multiplexers in Fig. 1 L This allows the 

10 change ofii.n coding rate and »eues > ' - poKnonnals <*t fion theRSC 

encoder. The ACSA unit in the figure can support the rate < 12 nd. 1/5 

turbo codes with afMm.'rmb t- »• i ene <« < tnerated polynomials. To support tew% 
lovs e; coding rate, the input to V *(s% s) calculation logic should be increased. To support 
multiple constraint lengths the number of ACSA units and an interconnection between 

15 the units can be changed. As mentioned above, the multiplexer 1 125 in the right 

determines the decoding algorithm: Log-MAP or Max-Log-M AP. If the reliable channel 
estimation from external host which calculates the channel estimation, by-mi^gis 
obtained, for example, using a power control bit of the 3G communication systems, we 
can obtain better performance with the Log-MAP algorithm by setting the > uml *vi < >f 

20 multiplexer passes tojhe look-up table value. On the other hand, if nothing is known 

about the channel, the Max-Log-MAP is used to avoid error due to channel misestimation 
error by passing 0 to the final adder of the ACSA unit. 

To keep pace with the hardware SiSO described with reference to figs 10 and 1 1, 
it is preferable that interleaved address generation he , i u , t i pai allcl p^xx-sing 

25 The present invention employs SIMD architecture because it is suitable for the simple 
and repetitive address generation and has simpler control and lower power consumption 
than superscalar or very long instruction word (VLIW) processor. Fig. 12 illustrates 
detailed SIMD processor of Fig. 8. In Fig, 12, a dotted line indicates control flow and a 
solid line indicates data flow. Considering the number of rows of W-CDMA block 

30 interl eav er i s am u.l ti pi e of fi ve, the SIMD processor 1200 includes five process? ng 
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elements PEO PE4. The bit widths of instructions and data are 16. The first processing 
element PEO (1201) controls the other four processing elements PE1 PE4 and in 
addition it processes a. scalar operation. The first processing element PEO 1201 fetches, 
decodes, and executes instructions including control and multi-cycle scalar instructions 
5 while the other four processing elements PEI --- PE4 only execute SSMD instructions. An 
!!"•!!<! t isvf ic !m corresponding to program counter (PC) 1210 is fetched from 
instruction memory 1227 and temporarily stored into the instruction register (IR) 1213. 
The fetched instruction is decoded by the decoder 12 1 4 and then the decoded instruction 
is executed accordingly in each the processing elements PEO ~ PE4, for example, each 

10 ALU 1221 executes add operations. After execution of the instruction is 

eoswple-ted^ «\« »•.. program counter (PC) 1210 is incremented by PC controller 121 1 
and anew instruction is fetched from the instruction memory 1227. The other four 
processing elements PEI (1203), PE2 (1205), PE3 (1207), and PE4 (1209) execute s 
execute SIMP instructions. All the processing elements PEO - PE4 fft64u4es 4nclude 

i 5 register block. 1 21 5 for storing data for parallel operations. The register block 1215 
includes vector registers VR.0 ~ VR.15 (1217). The register block 1215 of the first 
processing element PEO also includes additional scalar resisters 1219 to store scalar and 
control data. The second and fifth processing elements PEI -- PE4 iae4udes--incju.de 
register 1225 for temporarily storing encoded instruction. The SIMP instruction is not 

20 executed in all processing elements at the same time, but executed in one processing 
element after another so that a data memory port and I/O port can be shared in a time- 
multiplexed fashion, which saves memory access power and provides a simple I/O 
interface. 

As mentioned before, specialized SI AID processor instructions, STOLT, SUBGE, 
25 and LOOP, are employed to replace common frequent i nstruction sequences of three 
typical RISC instructions appearing in turbo interleave!- programs. S TOLT and SUBGE 
are SIMD instructions, whereas LOOP is a scalar control instruction, which is executed 
only in PEG. 

According to at least one embodiment of the present invention, the two stages of 
30 the block tu bo i uc leaw d on-fhe-fl} generation in 

the SIMD processor. The speed ofon-the-fiy generation dramatically improves using the 
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SIMD processor because of its SIMD parallei processing capability and its support of 
three special instructions. 

It should be noted that many variations and .modifications may be made to fee 
embodiments described above without substantially departing from the principles of the 
5 present invention. All such variation md modi 1 ons a; e attended to be included 
herein within the scope of the present invention, as set forth in the following claims. 
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TURBO DECODER AND TURBO INTERLEA VER 



ABSTRACT OF THE DISCLOSURE 

A processor on which u software 1 v nnetieax a is run s 
5 I k\> " ni > intcikasci valuation ..r * » is sphi into two parts to reduce the overhead 
time of interleave!- changing. Firsts preprocessing prepares seed variables^ requiring a 
small memory. Second, on-the-fly address generation generates interleaved «ekfce»s 
gddress.es through Simple adding and subtracting operations using the seed variables 
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